Data Pre-Processing to Train a Better Lithuanian-English MT System
نویسندگان
چکیده
Pried -as ir Protokol -as yr -a neatskiriam -a ši -o Susitar -imo dal -is. Prefixes separated, endings replaced by tense and number feature values System #2 Prefixes separated, all endings replaced by number feature values and verb endings also by time feature values System #3 Prefixes separated, endings deleted System #4 As Lithuanian is highly inflected language, the words change the form according to grammatical function. That means that the endings of nouns, pronouns, adjectives, numerals and verbs change depending on certain features. English instead does not have such a rich feature system. This difference between languages significantly impacts word and phrase alignment when training an SMT system. Typically one or two forms of an English word have to be aligned to more than ten different surface forms of a corresponding Lithuanian word. Lithuanian verbs have prefixes indicating negation and other semantic features while English verbs do not have prefixes and such information is expressed using modifying words. Many word forms are not as common as others in the corpus, therefore a Lithuanian-English SMT system does not translate all word forms equally well. It is very common to get many out of vocabulary words when translating from Lithuanian into English. Chosen approach
منابع مشابه
Language Model Data Augmentation for Keyword Spotting in Low-Resourced Training Conditions
This research extends our earlier work on using machine translation (MT) and word-based recurrent neural networks to augment language model training data for keyword search in conversational Cantonese speech. MT-based data augmentation is applied to two language pairs: English-Lithuanian and English-Amharic. Using filtered N-best MT hypotheses for language modeling is found to perform better th...
متن کاملEnglish-Lithuanian-English Machine Translation lexicon and engine: current state and future work
ENGLISH-LITHUANIAN-ENGLISH MACHINE TRANSLATION LEXICON AND ENGINE: CURRENT STATE AND FUTURE WORK Gintaras Barisevi ius, Bronius Tamulynas Kaunas University of Technology This article overviews the current state of the English-Lithuanian-English machine translation system. The first part of the article describes the problems that system poses today and what actions will be taken to solve them in...
متن کاملEvaluation Methodology and Results for English-to-Arabic MT
This paper describes the evaluation campaign of the MEDAR project for English-to-Arabic (EnAr) MT systems. The campaign aimed at establishing some basic facts about the state of the art for MT on EnAr, collecting enough data to better train and tune systems and assessing the improvements made. The paper details the data used and their formats, the evaluation methodology and the results obtained...
متن کاملThe MIT-LL/AFRL IWSLT-2010 MT system
This paper describes the MIT-LL/AFRL statistical MT system and the improvements that were developed during the IWSLT 2010 evaluation campaign. As part of these efforts, we experimented with a number of extensions to the standard phrase-based model that improve performance on the Arabic and Turkish to English translation tasks. We also participated in the new French to English BTEC and English t...
متن کاملTeaching MT Through Pre-editing: Three Case Studies
This article reports on three cases of teaching translation or English as a foreign language using pre-editing tasks with a machine translation system. Trainee translators or English learners were asked to input a Chinese or English paragraph into an MT system, observe the irregularities in the output, and subsequently edit the source text and input it again in the hope of getting better output...
متن کامل